The Critical Influence of Air Pollution and Socioeconomic Status on Cardiovascular Disease Mortality Rates in the U.S. with Public Health and Social Justice Implications¶

Bayowa Onabajo¶

Department of Applied Data Science and Analytics, Howard University¶

April 22, 2025¶

Introduction¶

Cardiovascular disease (CVD) continues to be one of the leading cause of death in the United States. While its biomedical causes are well-documented, this paper expands the discourse by examining the intersection between air pollution—particularly fine particulate matter (PM2.5)—and socioeconomic status (SES). Disparities in air pollution exposure remain a public health and social justice issue especially for densely populated communities. Fine particulate matter (PM2.5), a pollutant linked to industrial activity and human activity, has been shown to elevate risks for cardiovascular disease mortality, especially among populations in economically disadvantaged communities (Crouse et al., 2012; Di et al., 2017). Our research investigates, using a cross-sectional approach, whether there is a statistical relationship between cardiovascular disease mortality rates (CMR) and long-term PM2.5 exposure, and whether socioeconomic status influences this relationship.

Why Cardiovascular Disease is Significant.¶

Cardiovascular disease (CVD) is a class of diseases that affect the heart or blood vessels in simple terms. These conditions include and are not limited to coronary artery disease , stroke, heart failure and hypertension (more likely a risk factor). CVD is a critical public health concern due to its high prevalence and substantial impact on morbidity and mortality, contributing significantly to healthcare costs and reduced quality of life. An understanding of its determinants is essential for developing effective prevention, intervention strategies and for healthier communities.

Air Pollution and Particulate Matter 2.5 (PM2.5).¶

Air pollution, particularly fine particulate matter (PM2.5), has emerged as a significant environmental risk factor for CVD. PM2.5 refers to minute airborne particles that are 2.5 micrometers in diameter or less. These particles can be inhaled through the bronchi, bronchioles and alveoli of the lungs, entering the bloodstream and triggering a cascade of adverse physiological responses. Its vascular impacts are also documented as PM2.5 exposure is associated with increased inflammation, oxidative stress, endothelial dysfunction, and altered blood coagulation (Krittanawong et al.,2023). These processes contribute to the development and progression of atherosclerosis, hypertension, and CVD.

Socioeconomic Status: Factors and Importance¶

Socioeconomic status (SES) is a multifaceted social construct encompassing various socio-economic factors significantly influencing individuals and communities. Key indicators of SES include income, which affects access to essential resources such as healthcare, healthy food, and housing; education, which shapes health literacy, employment prospects, and health-promoting behaviors; and healthcare access, which determines the availability and quality of medical services for disease prevention, diagnosis, and treatment. Notably, lower socioeconomic status is frequently associated with increased exposure to risk factors for cardiovascular disease (Cox et al.,2018).

Public Health and Social Justice Implications¶

The confluence of elevated particulate matter 2.5 levels and low socioeconomic status (SES) carries significant implications for both public health and social justice (Ma et al.,2023). Communities characterized by lower SES frequently experience a disproportionate burden of cardiovascular disease (CVD). This disparity can be related to increased exposure to environmental pollutants coupled with diminished access to resources that could otherwise mitigate adverse health effects. The inequitable distribution represents a critical environmental injustice wherein marginalized populations are unjustly subjected to elevated health risks. Consequently, effectively addressing the multipronged challenge of CVD necessitates a holistic approach that integrates both biomedical and socio-environmental determinants. Interventions should be strategically designed to achieve a dual objective by reducing overall pollution levels and actively mitigating existing socioeconomic disparities to foster health equity.

This study analyzes data from 2,132 U.S. counties, using a cross-sectional approach to identify how geography, poverty, and pollution converge to produce avoidable, unequal mortality outcomes.This paper contributes to the growing body of research emphasizing the need for social justice policies that protect vulnerable populations and address health disparities driven by structural inequality.

Framework:¶

The study adopts a Fundamental cause theory multifactorial framework Phelan et al. (2010), emphasizing how environmental and social stressors interact in a way that intensifies harm beyond their individual effects.

Research questions:¶

What is the association between air pollution(PM2.5), socioeconomic factors (poverty, education, and health insurance) and cardiovascular mortality rates in the U.S.

How does hypertension rate influence cardiovascular mortality rates in the U.S.

Problem statement:¶

Cardiovascular disease (CVD) is a leading cause of death in the United States, with growing evidence suggesting that air pollution exposure measured as particulate matter 2.5(PM 2.5) influences cardiovascular morbidity, mortality and this disproportionately affects low-income populations. Individuals from lower socioeconomic backgrounds are more likely to live in areas with higher pollution levels, overcrowding, limited healthcare access, and economic stressors that contribute to CVD risk factors such as hypertension. These inequalities raise concerns about how socioeconomic and environmental conditions intersect in shaping public health outcomes. To what degree does air pollution and socioeconomic status influence cardiovascular mortality rates in disadvantaged populations?

Data Definition¶

American Community Survey (2009,2010): 1-Year Estimates.¶

Last Updated: January 25, 2024. https://www.census.gov/data/developers/data-sets/acs-1year/2009.html https://www.census.gov/data/developers/data-sets/acs-1year/2010.html These datasets consists of above 48,000 variables as part of the American community survey which provides data annually. The dataset covers broad social, housing, economic and demographic variables in all U.S. nations and states.The data are presented as counts. The variables from the ACS1 dataset were used in this paper as they are appropriate for the statistical approach needed to match the other datasets.

PM2.5 and cardiovascular mortality rate.¶

Last Updated: November 12, 2020 https://catalog.data.gov/dataset/annual-pm2-5-and-cardiovascular-mortality-rate-data-trends-modified-by-county-socioeconomi The dataset comprises socioeconomic status information for 2,132 counties in form of indexes and quintiles across the United States, provided by the U.S. Environmental Protection Agency. It also includes average annual cardiovascular mortality rates and total particulate matter 2.5 concentrations for each county over a 21-year span (1990–2010). The cardiovascular mortality data was collected from the U.S. National Center for Health Statistics, while PM2.5 levels were estimated using the EPA’s Community Multiscale Air Quality (CMAQ) modeling system. Additionally, socioeconomic data was extracted from the U.S. Census Bureau.

Heart Disease Mortality by State.¶

Last Updated: February 25, 2022 https://www.cdc.gov/nchs/pressroom/sosmap/heart_disease_mortality/heart_disease.htm The dataset shows the number of deaths per 100,000 population attributed to heart disease in U.S. states with variables like death rate and number of deaths. It also adjusts for differences in age distribution and population size.

Hypertension Mortality by State¶

Last Updated: March 3, 2022 https://www.cdc.gov/nchs/pressroom/sosmap/hypertension_mortality/hypertension.htm The dataset shows the number of deaths per 100,000 population attributed to hypertension in U.S. states with variables like death rate and number of deaths. It also adjusts for differences in age distribution and population size.

In [6]:
# Import libraries
import numpy as np                  # Scientific Computing
import pandas as pd                 # Data Analysis
import matplotlib.pyplot as plt     # Plotting
import seaborn as sns               # Statistical Data Visualization

# pandas returns all the rows and columns for the dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Force pandas to display full numbers instead of scientific notation
# pd.options.display.float_format = '{:.0f}'.format

# Library to suppress warnings
import warnings
warnings.filterwarnings('ignore')
In [7]:
# Read the dataset
path = pd.read_csv('/Users/bayowaonabajo/Downloads/SES_PM25_CMR_data-2/County_annual_PM25_CMR.csv')

# Create the Dataframe
df_annualcounty_pm25_cmr = pd.DataFrame(path)
In [8]:
# Read the dataset
path = pd.read_csv('/Users/bayowaonabajo/Downloads/SES_PM25_CMR_data-2/County_SES_index_quintile.csv')

# Create the Dataframe
df_county_sespm25_index_quintile = pd.DataFrame(path)
In [9]:
# Read the dataset
path = pd.read_csv('/Users/bayowaonabajo/Downloads/data-table-heart-dx-mort.csv')

# Create the Dataframe
df_heart_dx_mort = pd.DataFrame(path)
In [10]:
# Read the dataset
path = pd.read_csv('/Users/bayowaonabajo/Downloads/data-table-htn-dx-mort.csv')

# Create the Dataframe
df_htn_dx_mort = pd.DataFrame(path)
In [11]:
# Read the dataset
path = pd.read_csv('/Users/bayowaonabajo/Downloads/acs_vars_2009_2010_states.csv')

METHODOLOGY :¶

Data for this study were drawn from publicly available national sources and harmonized across a cross-sectional frame (2009–2010). Data cleaning, feature engineering were done for data analysis and visualizations. Statistical and visual analysis were done with explanations for key findings.

Data Cleaning and Preparation¶

In [13]:
df_annualcounty_pm25_cmr.head()
Out[13]:
Unnamed: 0 FIPS Year PM2.5 CMR fip_state state
0 1 1001 1990 9.749792 471.758888 1 AL
1 2 1001 1991 9.069443 456.869651 1 AL
2 3 1001 1992 9.105352 520.014377 1 AL
3 4 1001 1993 8.752873 454.436425 1 AL
4 5 1001 1994 9.024049 415.035332 1 AL
In [14]:
# Load the dataset
df = pd.read_csv('/Users/bayowaonabajo/Downloads/acs_vars_2009_2010_states.csv')

# State abbreviations mapping
state_abbreviations = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
    'Puerto Rico': 'PR'
}

# Replace state names with abbreviations
df['state'] = df['state'].map(state_abbreviations)



# Save the updated dataset to a new variable
df_acs_2009_2010_states = df

# Rename columns
df_acs_2009_2010_states = df_acs_2009_2010_states.rename(columns={'state.1': 'fip'})


df_acs_2009_2010_states.head()

 
Out[14]:
state median_income total_population_poverty poverty_count total_population_uninsured uninsured_count total_population_education_18 high_school_diploma ged_alternative associates_degree bachelors_degree masters_degree professional_degree doctorate_degree fip poverty_rate uninsured_rate educated_adults education_percent_educated_18 year
0 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 2009
1 AK 66953 682412 61653 678081 24993 431178 4388 68535 15906 34369 13071 3876 2806 2 9.034571 3.685843 142951 33.153593 2009
2 AZ 48745 6475485 1069897 6501531 207853 4248231 46247 513087 150479 348081 135252 41173 29019 4 16.522268 3.196985 1263338 29.737978 2009
3 AR 37823 2806056 527378 2833391 44061 1903914 18213 324262 41334 114200 33797 13430 7963 5 18.794279 1.555062 553199 29.055882 2009
4 CA 58931 36202780 5128708 36376938 890998 23782109 308968 2474351 820990 2220258 830392 306369 210817 6 14.166614 2.449349 7172145 30.157733 2009

Block for extracting the merging the acs variables needed

import censusdata¶

import requests¶

import pandas as pd¶

censusdata.census_api_key = "YOURAPIKEY" #apikey¶

Define API endpoint and parameters¶

base_url = "https://api.census.gov/data/%7Byear%7D/acs/acs1" variables = "NAME,B19013_001E,B17001_001E,B17001_002E,B27010_001E,B27010_017E,B15002_001E,B15002_010E,B15002_011E,B15002_014E,B15002_015E,B15002_016E,B15002_017E,B15002_018E" state_code = "*" # Fetch data for all states

Store dataframes in a list¶

all_dfs = []

Loop through the years 2009, and 2010¶

for year in [2009, 2010]: # Construct the API request URL, inserting the current year url = f"{base_url.format(year=year)}?get={variables}&for=state:{state_code}&key={censusdata.census_api_key}"

# Make the API request
response = requests.get(url)

# Check if successful
if response.status_code == 200:
    print(f"Data fetched for {year}!")
    data = response.json()  # Parse through JSON response

    header = data[0]  # First row contains column names
    rows = data[1:]  # Remaining rows containing data
    df_acs = pd.DataFrame(rows, columns=header)


    # Rename columns for clarity
    df_acs = df_acs.rename(columns={
        "NAME": "state",
        "B19013_001E": "median_income",
        "B17001_001E": "total_population_poverty",
        "B17001_002E": "poverty_count",
        "B27010_001E": "total_population_uninsured",
        "B27010_017E": "uninsured_count",
        "B15002_001E": "total_population_education_18",
        "B15002_010E": "high_school_diploma",
        "B15002_011E": "ged_alternative",
        "B15002_014E": "associates_degree",
        "B15002_015E": "bachelors_degree",
        "B15002_016E": "masters_degree",
        "B15002_017E": "professional_degree",
        "B15002_018E": "doctorate_degree"
    })

    # Convert numeric columns to appropriate data types
    numeric_columns = ["median_income", "total_population_poverty", "poverty_count",
                       "total_population_uninsured", "uninsured_count",
                       "total_population_education_18", "high_school_diploma",
                       "ged_alternative", "associates_degree", "bachelors_degree",
                       "masters_degree", "professional_degree", "doctorate_degree"]
    df_acs[numeric_columns] = df_acs[numeric_columns].apply(pd.to_numeric, errors="coerce")

    # Calculate percentages
    df_acs["poverty_rate"] = (df_acs["poverty_count"] / df_acs["total_population_poverty"]) * 100
    df_acs["uninsured_rate"] = (df_acs["uninsured_count"] / df_acs["total_population_uninsured"]) * 100


   #Calculate Educated Adults
    df_acs["educated_adults"] = df_acs["high_school_diploma"] + df_acs["ged_alternative"] + \
                                  df_acs["associates_degree"] + df_acs["bachelors_degree"] + \
                                  df_acs["masters_degree"] + df_acs["professional_degree"] + \
                                  df_acs["doctorate_degree"]

    df_acs["education_percent_educated_18"] = (df_acs["educated_adults"] / df_acs["total_population_education_18"]) * 100

    df_acs['year'] = year #add the year
    all_dfs.append(df_acs) #append to the list
else:
    print(f"Error for {year}: {response.status_code}")
    print(response.text)
    continue #Skips the current year to the next.

if not all_dfs:¶

print("Warning: No data was able to be collected.")

else:¶

df_acs_vars_09_10_states = pd.concat(all_dfs, ignore_index=True)
df_acs_vars_09_10_states
In [16]:
import pandas as pd

# Load the dataset
Ses_pm25_cmr_data = '/Users/bayowaonabajo/Downloads/SES_PM25_CMR_data-2/County_annual_PM25_CMR.csv'

df2 = pd.read_csv(Ses_pm25_cmr_data, dtype={'FIPS': str})  

# State FIPS to state abbreviation extracted from FIPS in original ses_pm25_cmr file encoded as two-digit State FIPS code and three-digit county code
state_fips_mapping = {
    '01': 'AL', '02': 'AK', '04': 'AZ', '05': 'AR', '06': 'CA', '08': 'CO', '09': 'CT',
    '10': 'DE', '11': 'DC', '12': 'FL', '13': 'GA', '15': 'HI', '16': 'ID', '17': 'IL',
    '18': 'IN', '19': 'IA', '20': 'KS', '21': 'KY', '22': 'LA', '23': 'ME', '24': 'MD',
    '25': 'MA', '26': 'MI', '27': 'MN', '28': 'MS', '29': 'MO', '30': 'MT', '31': 'NE',
    '32': 'NV', '33': 'NH', '34': 'NJ', '35': 'NM', '36': 'NY', '37': 'NC', '38': 'ND',
    '39': 'OH', '40': 'OK', '41': 'OR', '42': 'PA', '44': 'RI', '45': 'SC', '46': 'SD',
    '47': 'TN', '48': 'TX', '49': 'UT', '50': 'VT', '51': 'VA', '53': 'WA', '54': 'WV',
    '55': 'WI', '56': 'WY'
}

# Extract state FIPS and map to abbreviations
def extract_state_info(df):
    df['fip_state'] = df['FIPS'].str[:2]  # Extract first two digits
    df['state'] = df['fip_state'].map(state_fips_mapping)
    return df

df2 = extract_state_info(df2)
df2.head()

# update dataset with fip state codes and states
updated_file = '/Users/bayowaonabajo/Downloads/SES_PM25_CMR_data-2/County_annual_PM25_CMR.csv'
df2.to_csv(updated_file, index=False)

# Display few rows
df2.head()
Out[16]:
Unnamed: 0 FIPS Year PM2.5 CMR fip_state state
0 1 01001 1990 9.749792 471.758888 01 AL
1 2 01001 1991 9.069443 456.869651 01 AL
2 3 01001 1992 9.105352 520.014377 01 AL
3 4 01001 1993 8.752873 454.436425 01 AL
4 5 01001 1994 9.024049 415.035332 01 AL
In [17]:
import pandas as pd

# Load the dataset
Ses_index_quintile_file = '/Users/bayowaonabajo/Downloads/SES_PM25_CMR_data-2/County_SES_index_quintile.csv'

df1 = pd.read_csv(Ses_index_quintile_file, dtype={'FIPS': str})  

# State FIPS to state abbreviation extracted from FIPS in original ses_index_quintile file encoded as two-digit State FIPS code and three-digit county code
state_fips_mapping = {
    '01': 'AL', '02': 'AK', '04': 'AZ', '05': 'AR', '06': 'CA', '08': 'CO', '09': 'CT',
    '10': 'DE', '11': 'DC', '12': 'FL', '13': 'GA', '15': 'HI', '16': 'ID', '17': 'IL',
    '18': 'IN', '19': 'IA', '20': 'KS', '21': 'KY', '22': 'LA', '23': 'ME', '24': 'MD',
    '25': 'MA', '26': 'MI', '27': 'MN', '28': 'MS', '29': 'MO', '30': 'MT', '31': 'NE',
    '32': 'NV', '33': 'NH', '34': 'NJ', '35': 'NM', '36': 'NY', '37': 'NC', '38': 'ND',
    '39': 'OH', '40': 'OK', '41': 'OR', '42': 'PA', '44': 'RI', '45': 'SC', '46': 'SD',
    '47': 'TN', '48': 'TX', '49': 'UT', '50': 'VT', '51': 'VA', '53': 'WA', '54': 'WV',
    '55': 'WI', '56': 'WY'
}

# Extract state FIPS and map to abbreviations
def extract_state_info(df):
    df['fip_state'] = df['FIPS'].str[:2]  # Extract first two digits
    df['state'] = df['fip_state'].map(state_fips_mapping)
    return df

df1 = extract_state_info(df1)
df1.head()

# update dataset with fip state codes and states
updated_file = '/Users/bayowaonabajo/Downloads/SES_PM25_CMR_data-2/County_SES_index_quintile.csv'
df1.to_csv(updated_file, index=False)

df1

# Display few rows
df1.head()
Out[17]:
Unnamed: 0 FIPS SES_index_1990 SES_index_2000 SES_index_2010 SES_quintile_1990 SES_quintile_2000 SES_quintile_2010 fip_state state
0 1 01001 -0.079387 -0.322846 -0.405150 Q3 Q3 Q2 01 AL
1 2 01003 -0.187240 -0.467794 -0.403987 Q3 Q2 Q2 01 AL
2 3 01005 1.279538 2.013751 1.740142 Q5 Q5 Q5 01 AL
3 4 01009 0.124421 -0.375181 -0.405849 Q4 Q3 Q2 01 AL
4 5 01011 2.877256 3.519681 2.617074 Q5 Q5 Q5 01 AL
In [18]:
# Display first ten rows of the dataframe
df_annualcounty_pm25_cmr.head()
Out[18]:
Unnamed: 0 FIPS Year PM2.5 CMR fip_state state
0 1 1001 1990 9.749792 471.758888 1 AL
1 2 1001 1991 9.069443 456.869651 1 AL
2 3 1001 1992 9.105352 520.014377 1 AL
3 4 1001 1993 8.752873 454.436425 1 AL
4 5 1001 1994 9.024049 415.035332 1 AL
In [19]:
# Display last ten rows of the dataframe
df_annualcounty_pm25_cmr.tail(5)
Out[19]:
Unnamed: 0 FIPS Year PM2.5 CMR fip_state state
44767 44768 56037 2006 3.776910 247.510138 56 WY
44768 44769 56037 2007 3.609803 292.450269 56 WY
44769 44770 56037 2008 3.297100 182.189745 56 WY
44770 44771 56037 2009 3.119896 242.828987 56 WY
44771 44772 56037 2010 3.230996 254.860863 56 WY
In [20]:
path = pd.read_csv('/Users/bayowaonabajo/Downloads/SES_PM25_CMR_data-2/County_SES_index_quintile.csv')

df_county_sespm25_index_quintile = pd.DataFrame(path)
In [21]:
df_county_sespm25_index_quintile.head()
Out[21]:
Unnamed: 0 FIPS SES_index_1990 SES_index_2000 SES_index_2010 SES_quintile_1990 SES_quintile_2000 SES_quintile_2010 fip_state state
0 1 1001 -0.079387 -0.322846 -0.405150 Q3 Q3 Q2 1 AL
1 2 1003 -0.187240 -0.467794 -0.403987 Q3 Q2 Q2 1 AL
2 3 1005 1.279538 2.013751 1.740142 Q5 Q5 Q5 1 AL
3 4 1009 0.124421 -0.375181 -0.405849 Q4 Q3 Q2 1 AL
4 5 1011 2.877256 3.519681 2.617074 Q5 Q5 Q5 1 AL
In [22]:
#df_county_sespm25_index_quintile.tail()
In [23]:
df_heart_dx_mort.head()
Out[23]:
YEAR STATE RATE DEATHS URL
0 2022 AL 234.2 14958 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 145.7 1013 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 148.5 14593 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 224.1 8664 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 142.4 66340 /nchs/pressroom/states/california/ca.htm
In [24]:
#df_heart_dx_mort.tail()
In [25]:
df_htn_dx_mort['YEAR'].unique()
Out[25]:
array([2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2005])
In [26]:
df_htn_dx_mort.head(5)
Out[26]:
YEAR STATE RATE DEATHS URL
0 2022 AL 13.2 849 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 8.6 56 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 11.3 1109 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 12.1 454 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 14.4 6727 /nchs/pressroom/states/california/ca.htm
In [27]:
df_heart_dx_mort['YEAR'].unique()
Out[27]:
array([2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2005])
In [28]:
df_heart_dx_mort.head()
Out[28]:
YEAR STATE RATE DEATHS URL
0 2022 AL 234.2 14958 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 145.7 1013 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 148.5 14593 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 224.1 8664 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 142.4 66340 /nchs/pressroom/states/california/ca.htm
In [29]:
#df_htn_dx_mort.tail()
In [30]:
# Display first ten rows of the dataframe
df_acs_2009_2010_states.head()
Out[30]:
state median_income total_population_poverty poverty_count total_population_uninsured uninsured_count total_population_education_18 high_school_diploma ged_alternative associates_degree bachelors_degree masters_degree professional_degree doctorate_degree fip poverty_rate uninsured_rate educated_adults education_percent_educated_18 year
0 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 2009
1 AK 66953 682412 61653 678081 24993 431178 4388 68535 15906 34369 13071 3876 2806 2 9.034571 3.685843 142951 33.153593 2009
2 AZ 48745 6475485 1069897 6501531 207853 4248231 46247 513087 150479 348081 135252 41173 29019 4 16.522268 3.196985 1263338 29.737978 2009
3 AR 37823 2806056 527378 2833391 44061 1903914 18213 324262 41334 114200 33797 13430 7963 5 18.794279 1.555062 553199 29.055882 2009
4 CA 58931 36202780 5128708 36376938 890998 23782109 308968 2474351 820990 2220258 830392 306369 210817 6 14.166614 2.449349 7172145 30.157733 2009
In [31]:
# Display last ten rows of the dataframe
#df_acs_2009_2010_states.tail()
In [32]:
df_annualcounty_pm25_cmr.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44772 entries, 0 to 44771
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  44772 non-null  int64  
 1   FIPS        44772 non-null  int64  
 2   Year        44772 non-null  int64  
 3   PM2.5       44772 non-null  float64
 4   CMR         44772 non-null  float64
 5   fip_state   44772 non-null  int64  
 6   state       44772 non-null  object 
dtypes: float64(2), int64(4), object(1)
memory usage: 2.4+ MB
In [33]:
# This is the number of rows and columns in the data
df_annualcounty_pm25_cmr.shape
Out[33]:
(44772, 7)

The dataframe has 44772 rows and 7 columns. The total number of datapoints expected is 313404

In [35]:
df_county_sespm25_index_quintile.shape
Out[35]:
(2132, 10)

The dataframe has 2132 rows and 10 columns. The total number of datapoints expected is 21320

In [37]:
df_county_sespm25_index_quintile.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2132 entries, 0 to 2131
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         2132 non-null   int64  
 1   FIPS               2132 non-null   int64  
 2   SES_index_1990     2132 non-null   float64
 3   SES_index_2000     2132 non-null   float64
 4   SES_index_2010     2132 non-null   float64
 5   SES_quintile_1990  2132 non-null   object 
 6   SES_quintile_2000  2132 non-null   object 
 7   SES_quintile_2010  2132 non-null   object 
 8   fip_state          2132 non-null   int64  
 9   state              2132 non-null   object 
dtypes: float64(3), int64(3), object(4)
memory usage: 166.7+ KB
In [38]:
df_heart_dx_mort.shape
Out[38]:
(501, 5)

The dataframe has 501 rows and 5 columns. The total number of datapoints expected is 2505

In [40]:
df_heart_dx_mort.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 501 entries, 0 to 500
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   YEAR    501 non-null    int64  
 1   STATE   501 non-null    object 
 2   RATE    501 non-null    float64
 3   DEATHS  501 non-null    object 
 4   URL     501 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 19.7+ KB
In [41]:
df_htn_dx_mort.shape
Out[41]:
(501, 5)

The dataframe has 501 rows and 5 columns. The total number of datapoints expected is 2505

In [43]:
df_htn_dx_mort.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 501 entries, 0 to 500
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   YEAR    501 non-null    int64  
 1   STATE   501 non-null    object 
 2   RATE    501 non-null    float64
 3   DEATHS  501 non-null    object 
 4   URL     501 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 19.7+ KB
In [44]:
df_acs_2009_2010_states.shape
Out[44]:
(104, 20)

The dataframe has 104 rows and 20 columns. The total number of datapoints expected is 2080

In [46]:
df_acs_2009_2010_states.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104 entries, 0 to 103
Data columns (total 20 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   state                          104 non-null    object 
 1   median_income                  104 non-null    int64  
 2   total_population_poverty       104 non-null    int64  
 3   poverty_count                  104 non-null    int64  
 4   total_population_uninsured     104 non-null    int64  
 5   uninsured_count                104 non-null    int64  
 6   total_population_education_18  104 non-null    int64  
 7   high_school_diploma            104 non-null    int64  
 8   ged_alternative                104 non-null    int64  
 9   associates_degree              104 non-null    int64  
 10  bachelors_degree               104 non-null    int64  
 11  masters_degree                 104 non-null    int64  
 12  professional_degree            104 non-null    int64  
 13  doctorate_degree               104 non-null    int64  
 14  fip                            104 non-null    int64  
 15  poverty_rate                   104 non-null    float64
 16  uninsured_rate                 104 non-null    float64
 17  educated_adults                104 non-null    int64  
 18  education_percent_educated_18  104 non-null    float64
 19  year                           104 non-null    int64  
dtypes: float64(3), int64(16), object(1)
memory usage: 16.4+ KB
In [47]:
df_annualcounty_pm25_cmr['state'].unique()
Out[47]:
array(['AL', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'ID',
       'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN',
       'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND',
       'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT',
       'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)
In [48]:
#create a list of the columns in the dataset
df_annualcounty_pm25_cmrCol = df_annualcounty_pm25_cmr.columns 
df_annualcounty_pm25_cmrCol
Out[48]:
Index(['Unnamed: 0', 'FIPS', 'Year', 'PM2.5', 'CMR', 'fip_state', 'state'], dtype='object')
In [49]:
# Update the Headers for Consistency

df_annualcounty_pm25_cmrCol = df_annualcounty_pm25_cmr.rename(columns = {'Unnamed: 0':'indexes'})

# view the new columns and update the variable
df_annualcounty_pm25_cmr = df_annualcounty_pm25_cmrCol

df_annualcounty_pm25_cmr.head()
Out[49]:
indexes FIPS Year PM2.5 CMR fip_state state
0 1 1001 1990 9.749792 471.758888 1 AL
1 2 1001 1991 9.069443 456.869651 1 AL
2 3 1001 1992 9.105352 520.014377 1 AL
3 4 1001 1993 8.752873 454.436425 1 AL
4 5 1001 1994 9.024049 415.035332 1 AL

Renamed the column "Unnamed:0' to indexes for a more explanatory dataset.

In [51]:
df_annualcounty_pm25_cmr_filtered = df_annualcounty_pm25_cmr[(df_annualcounty_pm25_cmr['Year'] < 1990) | (df_annualcounty_pm25_cmr['Year'] > 2008)]
In [52]:
df_annualcounty_pm25_cmr_filtered.tail()
Out[52]:
indexes FIPS Year PM2.5 CMR fip_state state
44729 44730 56029 2010 2.571525 170.765285 56 WY
44749 44750 56033 2009 2.566431 235.312525 56 WY
44750 44751 56033 2010 2.642380 175.671813 56 WY
44770 44771 56037 2009 3.119896 242.828987 56 WY
44771 44772 56037 2010 3.230996 254.860863 56 WY

Dropped rows with year 1990 to 2008 for a matching analysis of timeline with the ACS 2009 and 2010 dataset. Dropping the rows narrowed the number of states in the dataset to 49 from 50.

In [54]:
df_annualstate_county_pm25_cmr = df_annualcounty_pm25_cmr_filtered

df_annualstate_county_pm25_cmr['state'].unique()
Out[54]:
array(['AL', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FL', 'GA', 'ID',
       'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN',
       'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND',
       'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT',
       'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)
In [55]:
# Determine the number of missing values
df_annualstate_county_pm25_cmr.isnull().sum()
Out[55]:
indexes      0
FIPS         0
Year         0
PM2.5        0
CMR          0
fip_state    0
state        0
dtype: int64
In [56]:
# Determine the percentage of missing values
# Typically less than five percent missing values may not affect the results
# More than 5% can be dropped, replaced with existing data, or imputed using mean or median.

def missing(Dataframe):
    print('Percentage of missing values in the dataset:\n',
          round((Dataframe.isnull().sum() *100/len(Dataframe)), 2).sort_values(ascending=False))
    

missing(df_annualstate_county_pm25_cmr)
Percentage of missing values in the dataset:
 indexes      0.0
FIPS         0.0
Year         0.0
PM2.5        0.0
CMR          0.0
fip_state    0.0
state        0.0
dtype: float64

I have no missing values in this dataset which is good for my analysis as it allows for a faster and complete statistical analysis, exploration and visualization

In [58]:
# create a list of the columns in the dataset
df_county_sespm25_index_quintileCol = df_county_sespm25_index_quintile.columns 
df_county_sespm25_index_quintileCol
Out[58]:
Index(['Unnamed: 0', 'FIPS', 'SES_index_1990', 'SES_index_2000',
       'SES_index_2010', 'SES_quintile_1990', 'SES_quintile_2000',
       'SES_quintile_2010', 'fip_state', 'state'],
      dtype='object')
In [59]:
# Update the Headers for Syntax Consistency

df_county_sespm25_index_quintileCol = df_county_sespm25_index_quintile.rename(columns = {'Unnamed: 0':'indexes'})

# view the new columns and update the variable

df_county_sespm25_index_quintile = df_county_sespm25_index_quintileCol

df_county_sespm25_index_quintile.head()
Out[59]:
indexes FIPS SES_index_1990 SES_index_2000 SES_index_2010 SES_quintile_1990 SES_quintile_2000 SES_quintile_2010 fip_state state
0 1 1001 -0.079387 -0.322846 -0.405150 Q3 Q3 Q2 1 AL
1 2 1003 -0.187240 -0.467794 -0.403987 Q3 Q2 Q2 1 AL
2 3 1005 1.279538 2.013751 1.740142 Q5 Q5 Q5 1 AL
3 4 1009 0.124421 -0.375181 -0.405849 Q4 Q3 Q2 1 AL
4 5 1011 2.877256 3.519681 2.617074 Q5 Q5 Q5 1 AL

Renamed the column "Unnamed:0' to indexes for a more explanatory dataset.

In [61]:
# Determine the number of missing values

df_county_sespm25_index_quintile.isnull().sum()
Out[61]:
indexes              0
FIPS                 0
SES_index_1990       0
SES_index_2000       0
SES_index_2010       0
SES_quintile_1990    0
SES_quintile_2000    0
SES_quintile_2010    0
fip_state            0
state                0
dtype: int64
In [62]:
#  function to determine the percentage of missing values
# Typically less than five percent missing values may not affect the results
# More than 5% can be dropped, replaced with existing data, or imputed using mean or median.

def missing(Dataframe):
    print('Percentage of missing values in the dataset:\n',
          round((Dataframe.isnull().sum() *100/len(Dataframe)), 2).sort_values(ascending=False))
    

missing(df_county_sespm25_index_quintile)
Percentage of missing values in the dataset:
 indexes              0.0
FIPS                 0.0
SES_index_1990       0.0
SES_index_2000       0.0
SES_index_2010       0.0
SES_quintile_1990    0.0
SES_quintile_2000    0.0
SES_quintile_2010    0.0
fip_state            0.0
state                0.0
dtype: float64
In [63]:
#create a list of the columns in the dataset
df_heart_dx_mortCol = df_heart_dx_mort.columns 
df_heart_dx_mortCol
Out[63]:
Index(['YEAR', 'STATE', 'RATE', 'DEATHS', 'URL'], dtype='object')
In [64]:
#create a list of the columns in the dataset
df_heart_dx_mortCol = df_heart_dx_mort.columns 
df_heart_dx_mortCol 
Out[64]:
Index(['YEAR', 'STATE', 'RATE', 'DEATHS', 'URL'], dtype='object')
In [65]:
# Update the Headers for Consistency

df_heart_dx_mortCol = df_heart_dx_mort.rename(columns = {'STATE':'state'})

# view the new columns and update the variable

df_heart_dx_mort = df_heart_dx_mortCol

df_heart_dx_mort.head()
Out[65]:
YEAR state RATE DEATHS URL
0 2022 AL 234.2 14958 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 145.7 1013 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 148.5 14593 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 224.1 8664 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 142.4 66340 /nchs/pressroom/states/california/ca.htm

Changed the column name 'STATE' to 'state' in this cardiovascular disease rate dataset to allign with similar column names in the other datasets for easier manipulation and merging if needed.

In [67]:
df_heart_dx_mort.head()
Out[67]:
YEAR state RATE DEATHS URL
0 2022 AL 234.2 14958 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 145.7 1013 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 148.5 14593 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 224.1 8664 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 142.4 66340 /nchs/pressroom/states/california/ca.htm
In [68]:
# Load the dataset
df = df_heart_dx_mort

df['state'] = df['state'].replace({
    'District of Columbia' : 'DC',
    
})

# Save the updated dataset
df_heart_dx_mort = df


df_heart_dx_mort.head(5)
Out[68]:
YEAR state RATE DEATHS URL
0 2022 AL 234.2 14958 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 145.7 1013 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 148.5 14593 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 224.1 8664 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 142.4 66340 /nchs/pressroom/states/california/ca.htm

Changed the variable 'District of columbia' to 'DC' in the state column for conformity with the rest of the dataset.

In [70]:
print(df['state'].unique())
['AL' 'AK' 'AZ' 'AR' 'CA' 'CO' 'CT' 'DE' 'DC' 'FL' 'GA' 'HI' 'ID' 'IL'
 'IN' 'IA' 'KS' 'KY' 'LA' 'ME' 'MD' 'MA' 'MI' 'MN' 'MS' 'MO' 'MT' 'NE'
 'NV' 'NH' 'NJ' 'NM' 'NY' 'NC' 'ND' 'OH' 'OK' 'OR' 'PA' 'RI' 'SC' 'SD'
 'TN' 'TX' 'UT' 'VT' 'VA' 'WA' 'WV' 'WI' 'WY']
In [71]:
# Determine the number of missing values

df_heart_dx_mort.isnull().sum()
Out[71]:
YEAR      0
state     0
RATE      0
DEATHS    0
URL       0
dtype: int64
In [72]:
def missing(Dataframe):
    print('Percentage of missing values in the dataset:\n',
          round((Dataframe.isnull().sum() *100/len(Dataframe)), 2).sort_values(ascending=False))
    

missing(df_heart_dx_mort)
Percentage of missing values in the dataset:
 YEAR      0.0
state     0.0
RATE      0.0
DEATHS    0.0
URL       0.0
dtype: float64

I have no missing values in this dataset which is also good for my analysis as it allows for a faster and complete statistical analysis, exploration and visualization

In [74]:
#create a list of the columns in the dataset
df_htn_dx_mortCol = df_htn_dx_mort.columns 
df_htn_dx_mortCol
Out[74]:
Index(['YEAR', 'STATE', 'RATE', 'DEATHS', 'URL'], dtype='object')

Changed the column name 'STATE' to 'state' in this hypertensive disease rate dataset to allign with similar column names in the other datasets for easier manipulation and merging if needed.

In [76]:
# Update the Headers for Consistency

df_htn_dx_mortCol = df_htn_dx_mort.rename(columns = {'STATE':'state'})

# view the new columns and update the variable

df_htn_dx_mort = df_htn_dx_mortCol

df_htn_dx_mort.head()
Out[76]:
YEAR state RATE DEATHS URL
0 2022 AL 13.2 849 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 8.6 56 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 11.3 1109 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 12.1 454 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 14.4 6727 /nchs/pressroom/states/california/ca.htm
In [77]:
# Load the dataset
df = df_htn_dx_mort

df['state'] = df['state'].replace({
    'District of Columbia' : 'DC',
    
})

# Save the updated dataset
df_htn_dx_mort = df


df_htn_dx_mort.head()
Out[77]:
YEAR state RATE DEATHS URL
0 2022 AL 13.2 849 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 8.6 56 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 11.3 1109 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 12.1 454 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 14.4 6727 /nchs/pressroom/states/california/ca.htm

Changed the variable 'District of columbia' to 'DC' in the state column for conformity with the rest of the dataset.

In [79]:
# number of missing values

df_htn_dx_mort.isnull().sum()
Out[79]:
YEAR      0
state     0
RATE      0
DEATHS    0
URL       0
dtype: int64
In [80]:
def missing(Dataframe):
    print('Percentage of missing values in the dataset:\n',
          round((Dataframe.isnull().sum() *100/len(Dataframe)), 2).sort_values(ascending=False))
    

missing(df_htn_dx_mort)
Percentage of missing values in the dataset:
 YEAR      0.0
state     0.0
RATE      0.0
DEATHS    0.0
URL       0.0
dtype: float64

I have no missing values in this dataset which is also good for my analysis as it allows for a faster and complete statistical analysis, exploration and visualization

In [82]:
#create a list of the columns in the dataset
df_acs_2009_2010_statesCol = df_acs_2009_2010_states.columns 
df_acs_2009_2010_statesCol
Out[82]:
Index(['state', 'median_income', 'total_population_poverty', 'poverty_count',
       'total_population_uninsured', 'uninsured_count',
       'total_population_education_18', 'high_school_diploma',
       'ged_alternative', 'associates_degree', 'bachelors_degree',
       'masters_degree', 'professional_degree', 'doctorate_degree', 'fip',
       'poverty_rate', 'uninsured_rate', 'educated_adults',
       'education_percent_educated_18', 'year'],
      dtype='object')

The column names in this collated ACS rate dataset allign with research goals so i will keep them as they are.

In [84]:
df_acs_2009_2010_states.head()
Out[84]:
state median_income total_population_poverty poverty_count total_population_uninsured uninsured_count total_population_education_18 high_school_diploma ged_alternative associates_degree bachelors_degree masters_degree professional_degree doctorate_degree fip poverty_rate uninsured_rate educated_adults education_percent_educated_18 year
0 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 2009
1 AK 66953 682412 61653 678081 24993 431178 4388 68535 15906 34369 13071 3876 2806 2 9.034571 3.685843 142951 33.153593 2009
2 AZ 48745 6475485 1069897 6501531 207853 4248231 46247 513087 150479 348081 135252 41173 29019 4 16.522268 3.196985 1263338 29.737978 2009
3 AR 37823 2806056 527378 2833391 44061 1903914 18213 324262 41334 114200 33797 13430 7963 5 18.794279 1.555062 553199 29.055882 2009
4 CA 58931 36202780 5128708 36376938 890998 23782109 308968 2474351 820990 2220258 830392 306369 210817 6 14.166614 2.449349 7172145 30.157733 2009
In [85]:
# number of missing values

df_acs_2009_2010_states.isnull().sum()
Out[85]:
state                            0
median_income                    0
total_population_poverty         0
poverty_count                    0
total_population_uninsured       0
uninsured_count                  0
total_population_education_18    0
high_school_diploma              0
ged_alternative                  0
associates_degree                0
bachelors_degree                 0
masters_degree                   0
professional_degree              0
doctorate_degree                 0
fip                              0
poverty_rate                     0
uninsured_rate                   0
educated_adults                  0
education_percent_educated_18    0
year                             0
dtype: int64
In [86]:
def missing(Dataframe):
    print('Percentage of missing values in the dataset:\n',
          round((Dataframe.isnull().sum() *100/len(Dataframe)), 2).sort_values(ascending=False))
    

missing(df_acs_2009_2010_states)
Percentage of missing values in the dataset:
 state                            0.0
median_income                    0.0
education_percent_educated_18    0.0
educated_adults                  0.0
uninsured_rate                   0.0
poverty_rate                     0.0
fip                              0.0
doctorate_degree                 0.0
professional_degree              0.0
masters_degree                   0.0
bachelors_degree                 0.0
associates_degree                0.0
ged_alternative                  0.0
high_school_diploma              0.0
total_population_education_18    0.0
uninsured_count                  0.0
total_population_uninsured       0.0
poverty_count                    0.0
total_population_poverty         0.0
year                             0.0
dtype: float64

I have no missing values in this collated ACS dataset which is also good for my analysis as it allows for a faster and complete statistical analysis, exploration and visualization

In [88]:
df_annualstate_county_pm25_cmr.head() 
Out[88]:
indexes FIPS Year PM2.5 CMR fip_state state
19 20 1001 2009 6.402091 330.876172 1 AL
20 21 1001 2010 6.942778 316.911479 1 AL
40 41 1003 2009 5.419087 270.402216 1 AL
41 42 1003 2010 5.837704 276.377191 1 AL
61 62 1005 2009 5.840124 383.159080 1 AL

Exploratory Data Analysis and Feature Engineering¶

Descriptive Statistics¶

In [90]:
df_annualstate_county_pm25_cmr.describe()
Out[90]:
indexes FIPS Year PM2.5 CMR fip_state
count 4264.000000 4264.000000 4264.000000 4264.000000 4264.000000 4264.0000
mean 22396.000000 30599.787992 2009.500000 6.171229 257.605458 30.5000
std 12926.077525 15142.415588 0.500059 1.396911 56.675549 15.1239
min 20.000000 1001.000000 2009.000000 2.192728 106.135757 1.0000
25% 11208.000000 18162.500000 2009.000000 5.521922 216.515285 18.0000
50% 22396.000000 29164.000000 2009.500000 6.391946 250.385485 29.0000
75% 33584.000000 45019.500000 2010.000000 7.126114 291.266376 45.0000
max 44772.000000 56037.000000 2010.000000 9.384544 557.426037 56.0000

The minimum and maximum values for the pm2.5 are 2.19 µg/m³ and 9.38 µg/m³ while the minimum and maximum values for the cardiovascular mortality rate are 106.1 per 100,000 and 557.4 per 100,000.

The mean PM2.5 of 6.17 and median of 6.39 suggests a relatively normal distribution for particulate matter of size 2.5

The mean CMR of 257.6 and median of 250.4 suggests a near symmetric distribution as well.

The quartile ranges are 25th percentile of 5.5 and 216.5 for PM2.5 and CMR respectively. The 75th percentile are 7.12 and 291.26 for PM2.5 and CMR respectively.

The standard deviation of PM2.5 at 1.39 indicates small variability across counties and states.However the standard deviation of CMR at 56.7 shows a high spread in cardiovascular mortality rates across states.

In [92]:
df_heart_dx_mort.describe()
Out[92]:
YEAR RATE
count 501.000000 501.000000
mean 2016.710579 172.287425
std 4.611515 32.655107
min 2005.000000 114.900000
25% 2015.000000 149.300000
50% 2018.000000 163.400000
75% 2020.000000 192.000000
max 2022.000000 306.400000

The minimum and maximum values for this dataframe are 114.9 and 306.4 per 100,000.

The mean of 172.3 and median of 163.4 suggests a right-skewed distribution.

The quartile ranges are 25th percentile of 149.3. and 75th percentile of 192.0.

The standard deviation of 32.7 is high and could allude to significant differences in heart disease mortality rates across states in the USA.

In [94]:
df_htn_dx_mort.describe()
Out[94]:
YEAR RATE
count 501.000000 501.000000
mean 2016.710579 8.628343
std 4.611515 2.518634
min 2005.000000 0.000000
25% 2015.000000 6.900000
50% 2018.000000 8.300000
75% 2020.000000 10.100000
max 2022.000000 20.400000

The minimum and maximum values for this dataframe are 0.0 and 20.4 deaths per 100,000.

The mean of 8.63 and median of 8.30 suggests a right-skewed distribution.

The quartile ranges are 25th percentile of 6.9 and 75th percentile of 10.1.

The standard deviation of 2.51 indicates moderate variability in hypertension mortality rates across states in the USA.

In [96]:
df_county_sespm25_index_quintile.describe()
Out[96]:
indexes FIPS SES_index_1990 SES_index_2000 SES_index_2010 fip_state
count 2132.000000 2132.000000 2.132000e+03 2.132000e+03 2.132000e+03 2132.000000
mean 1066.500000 30599.787992 -7.332054e-17 8.998431e-17 1.999651e-17 30.500000
std 615.599708 15144.191928 9.641826e-01 9.837311e-01 9.556947e-01 15.125674
min 1.000000 1001.000000 -2.535586e+00 -1.646289e+00 -1.836970e+00 1.000000
25% 533.750000 18162.500000 -6.293172e-01 -6.843596e-01 -6.735622e-01 18.000000
50% 1066.500000 29164.000000 -1.083418e-01 -2.034422e-01 -1.362228e-01 29.000000
75% 1599.250000 45019.500000 5.120400e-01 4.586209e-01 4.726322e-01 45.000000
max 2132.000000 56037.000000 5.645396e+00 6.646980e+00 6.456330e+00 56.000000

The mean index of 1066 and median of 1066 indicates a normal distribution.

In [98]:
df_acs_2009_2010_states.describe()
Out[98]:
median_income total_population_poverty poverty_count total_population_uninsured uninsured_count total_population_education_18 high_school_diploma ged_alternative associates_degree bachelors_degree masters_degree professional_degree doctorate_degree fip poverty_rate uninsured_rate educated_adults education_percent_educated_18 year
count 104.000000 1.040000e+02 1.040000e+02 1.040000e+02 1.040000e+02 1.040000e+02 104.000000 1.040000e+02 104.000000 1.040000e+02 104.000000 104.000000 104.000000 104.000000 104.000000 104.000000 1.040000e+02 104.000000 104.000000
mean 49604.144231 5.847806e+06 8.895052e+05 5.898023e+06 1.189426e+05 3.954632e+06 36626.605769 5.468577e+05 128807.413462 3.352190e+05 128972.538462 45831.567308 29320.500000 29.788462 14.904238 1.825752 1.251635e+06 32.088187 2009.500000
std 9270.377961 6.565761e+06 1.035133e+06 6.609771e+06 1.921948e+05 4.374467e+06 51599.678880 5.364355e+05 146554.195832 3.906015e+05 153383.624955 55949.004524 35674.130151 16.692928 5.258778 0.942193 1.347906e+06 2.150077 0.502421
min 18314.000000 5.299820e+05 5.214400e+04 5.337160e+05 2.309000e+03 3.557930e+05 2014.000000 3.896300e+04 4215.000000 2.796800e+04 9389.000000 2694.000000 2056.000000 1.000000 8.286452 0.305054 1.193710e+05 25.197119 2009.000000
25% 43628.000000 1.689948e+06 2.264710e+05 1.710142e+06 2.477325e+04 1.111880e+06 7695.750000 1.616908e+05 40002.750000 8.855700e+04 29652.250000 12377.250000 7564.500000 16.750000 11.822853 1.154294 3.617645e+05 30.592055 2009.000000
50% 48258.000000 4.056070e+06 6.204850e+05 4.082100e+06 6.684400e+04 2.746110e+06 23654.000000 3.791155e+05 87378.000000 2.021790e+05 69815.500000 26596.000000 15831.500000 29.500000 14.240783 1.542725 8.376770e+05 32.441705 2009.500000
75% 55437.250000 6.489280e+06 9.851172e+05 6.512686e+06 1.220512e+05 4.466785e+06 38665.750000 6.576772e+05 153692.250000 4.507875e+05 169878.250000 59284.250000 39900.500000 42.500000 17.081788 2.421341 1.490284e+06 33.640857 2010.000000
max 69272.000000 3.659337e+07 5.783043e+06 3.681557e+07 1.119685e+06 2.409720e+07 324410.000000 2.474351e+06 822526.000000 2.220258e+06 849249.000000 306369.000000 219994.000000 72.000000 45.032912 4.650732 7.191509e+06 36.254092 2010.000000

The dataset shows considerable variability across several socioeconomic indicators. Median income ranges from a low of 18,314 to a high of 69,272, reflecting significant economic disparities. The number of individuals without health insurance also varies widely, from as few as 2,532 to as many as 914,426 people, highlighting potential disparities in healthcare access. Educational attainment, specifically the percentage of the state population with only higher education, spans from 25.2% to 36.25%. The average rate of higher education is 32.1%, closely aligned with the median of 32.44%, suggesting a relatively symmetric distribution with minimal skewness. In contrast, poverty rates exhibit a broader spread, ranging from 8.3% to 45.03%. The mean poverty rate is 14.9%, while the median is slightly lower at 14.24%, indicating a right-skewed distribution where a smaller number of states experience significantly higher poverty levels. Supporting this, the interquartile range (IQR) for poverty is 5.25, signifying notable dispersion within the central 50% of the data. Moreover, the variance in poverty rate exceeds the mean, highlighting substantial variability across observations. Health uninsurance rates, while generally lower, still display meaningful variation—from 0.31% to 4.65%. The mean rate stands at 1.82%, compared to a median of 1.54%, again suggesting a mild right-skew in the distribution. However, the variance here is relatively low (0.88), indicating that the data is more clustered around the central tendency than other variables. Overall, the patterns suggest that while some indicators like educational attainment show consistency across states, others—particularly poverty and income—reveal significant inequality. The skewed distributions and wide IQRs in these domains may require further investigation into structural and regional factors influencing these disparities.

In [100]:
#Merge SES index quintile data and PM25/CMR data
#Read SES data with 'FIPS' as str and load
df_county_ses_quintile_index = df_county_sespm25_index_quintile
df_county_ses_quintile_index['FIPS'] = df_county_ses_quintile_index['FIPS'].astype(str)

# Ensure df_pm25_cmr is also a string
df_pm25_cmr = df_annualstate_county_pm25_cmr
df_pm25_cmr['FIPS'] = df_pm25_cmr['FIPS'].astype(str)

# Merge  on 'FIPS'
df_merged_state_county = pd.merge(df_pm25_cmr, df_county_ses_quintile_index, on='FIPS', how='inner')


# View merged DataFrame
df_merged_state_county.head()
Out[100]:
indexes_x FIPS Year PM2.5 CMR fip_state_x state_x indexes_y SES_index_1990 SES_index_2000 SES_index_2010 SES_quintile_1990 SES_quintile_2000 SES_quintile_2010 fip_state_y state_y
0 20 1001 2009 6.402091 330.876172 1 AL 1 -0.079387 -0.322846 -0.405150 Q3 Q3 Q2 1 AL
1 21 1001 2010 6.942778 316.911479 1 AL 1 -0.079387 -0.322846 -0.405150 Q3 Q3 Q2 1 AL
2 41 1003 2009 5.419087 270.402216 1 AL 2 -0.187240 -0.467794 -0.403987 Q3 Q2 Q2 1 AL
3 42 1003 2010 5.837704 276.377191 1 AL 2 -0.187240 -0.467794 -0.403987 Q3 Q2 Q2 1 AL
4 62 1005 2009 5.840124 383.159080 1 AL 3 1.279538 2.013751 1.740142 Q5 Q5 Q5 1 AL
In [101]:
# Feature Engineering
# Drop only existing columns
df_merged_state_county = df_merged_state_county.drop(columns=['fip_state_y', 'state_y','indexes_y','indexes_x'])

# Rename columns
df_merged_state_county = df_merged_state_county.rename(columns={ 'fip_state_x': 'fip','state_x': 'state'})

# View merged DataFrame
df_merged_state_county.head(20)
Out[101]:
FIPS Year PM2.5 CMR fip state SES_index_1990 SES_index_2000 SES_index_2010 SES_quintile_1990 SES_quintile_2000 SES_quintile_2010
0 1001 2009 6.402091 330.876172 1 AL -0.079387 -0.322846 -0.405150 Q3 Q3 Q2
1 1001 2010 6.942778 316.911479 1 AL -0.079387 -0.322846 -0.405150 Q3 Q3 Q2
2 1003 2009 5.419087 270.402216 1 AL -0.187240 -0.467794 -0.403987 Q3 Q2 Q2
3 1003 2010 5.837704 276.377191 1 AL -0.187240 -0.467794 -0.403987 Q3 Q2 Q2
4 1005 2009 5.840124 383.159080 1 AL 1.279538 2.013751 1.740142 Q5 Q5 Q5
5 1005 2010 6.339941 387.051896 1 AL 1.279538 2.013751 1.740142 Q5 Q5 Q5
6 1009 2009 7.091090 285.100812 1 AL 0.124421 -0.375181 -0.405849 Q4 Q3 Q2
7 1009 2010 7.897200 279.421128 1 AL 0.124421 -0.375181 -0.405849 Q4 Q3 Q2
8 1011 2009 6.548729 310.851335 1 AL 2.877256 3.519681 2.617074 Q5 Q5 Q5
9 1011 2010 7.171266 362.096030 1 AL 2.877256 3.519681 2.617074 Q5 Q5 Q5
10 1013 2009 5.553551 283.798082 1 AL 1.922153 1.858747 1.680438 Q5 Q5 Q5
11 1013 2010 6.013731 394.257094 1 AL 1.922153 1.858747 1.680438 Q5 Q5 Q5
12 1015 2009 6.582951 355.071369 1 AL 0.103711 0.448460 0.913785 Q4 Q4 Q5
13 1015 2010 7.406110 354.016025 1 AL 0.103711 0.448460 0.913785 Q4 Q4 Q5
14 1017 2009 6.183137 360.897531 1 AL 0.660426 0.829457 1.443492 Q4 Q5 Q5
15 1017 2010 6.865899 366.882019 1 AL 0.660426 0.829457 1.443492 Q4 Q5 Q5
16 1021 2009 6.037810 344.930926 1 AL 0.492201 0.316738 0.340982 Q4 Q4 Q4
17 1021 2010 6.720577 308.845625 1 AL 0.492201 0.316738 0.340982 Q4 Q4 Q4
18 1023 2009 5.263957 376.460282 1 AL 1.802146 1.774375 0.742904 Q5 Q5 Q5
19 1023 2010 5.834130 355.032353 1 AL 1.802146 1.774375 0.742904 Q5 Q5 Q5
In [155]:
#Utilizing plotly

import plotly.express as px

df_state_2010 = df_merged_state_county[df_merged_state_county['Year'] == 2010].groupby('state', as_index=False).agg({
    'PM2.5': 'mean',
    'CMR': 'mean',
    'SES_index_2010': 'mean'  
})

# --- Choropleth Map ---
fig_pm25 = px.choropleth(
    df_state_2010,
    locations='state',           
    locationmode="USA-states",   
    color='PM2.5',              
    scope="usa",               
    color_continuous_scale="Viridis",
    title="Average PM2.5 Levels in 2010",
    hover_data=['state', 'PM2.5']  
)

fig_cmr = px.choropleth(
    df_state_2010,
    locations='state',
    locationmode="USA-states",
    color='CMR',
    scope="usa",
    color_continuous_scale="OrRd",
    title="Average Cardiovascular Mortality Rates in 2010",
    hover_data=['state', 'CMR']
)

fig_ses = px.choropleth(
    df_state_2010,
    locations='state',
    locationmode="USA-states",
    color='SES_index_2010',
    scope="usa",
    color_continuous_scale="Plasma",
    title="Average Socioeconomic Status Index in 2010",
    hover_data=['state', 'SES_index_2010']
)

fig_pm25.show()
fig_cmr.show()
fig_ses.show()
In [157]:
# Feature Engineering 

df1 = df_acs_2009_2010_states

df2 = df_annualstate_county_pm25_cmr

# second dataset has state-level FIPS in a different column, rename it to 'fip_state'


df2.rename(columns={'fip_state': 'fip'}, inplace=True)

df_acs_pm25_cmr_ses_index_state_combined = pd.merge(df1, df2, how='inner', on='fip')

df_acs_pm25_cmr_ses_index_state_combined.rename(columns={'state_x': 'state'}, inplace=True)

df_acs_pm25_cmr_ses_index_state_combined.drop(columns=['state_y'], inplace=True)

df_acs_pm25_cmr_ses_index_state_combined.head(5)
Out[157]:
state median_income total_population_poverty poverty_count total_population_uninsured uninsured_count total_population_education_18 high_school_diploma ged_alternative associates_degree bachelors_degree masters_degree professional_degree doctorate_degree fip poverty_rate uninsured_rate educated_adults education_percent_educated_18 year indexes FIPS Year PM2.5 CMR
0 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 2009 20 1001 2009 6.402091 330.876172
1 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 2009 21 1001 2010 6.942778 316.911479
2 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 2009 41 1003 2009 5.419087 270.402216
3 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 2009 42 1003 2010 5.837704 276.377191
4 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 2009 62 1005 2009 5.840124 383.159080
In [161]:
#Utilizing plotly

import plotly.express as px

df_state_2010 = df_acs_pm25_cmr_ses_index_state_combined[df_acs_pm25_cmr_ses_index_state_combined['Year'] == 2010].groupby('state', as_index=False).agg({
    'PM2.5': 'mean',
    'CMR': 'mean',
    'poverty_rate': 'mean'  
})

# --- Choropleth Map ---
fig_pm25 = px.choropleth(
    df_state_2010,
    locations='state',           
    locationmode="USA-states",   
    color='PM2.5',              
    scope="usa",               
    color_continuous_scale="Viridis",
    title="Average PM2.5 Levels in 2010",
    hover_data=['state', 'PM2.5']  
)

fig_cmr = px.choropleth(
    df_state_2010,
    locations='state',
    locationmode="USA-states",
    color='CMR',
    scope="usa",
    color_continuous_scale="OrRd",
    title="Average Cardiovascular Mortality Rates in 2010",
    hover_data=['state', 'CMR']
)

fig_poverty = px.choropleth(
    df_state_2010,
    locations='state',
    locationmode="USA-states",
    color='poverty_rate',
    scope="usa",
    color_continuous_scale="Plasma",
    title="Average Poverty Rate in 2010",
    hover_data=['state', 'poverty_rate']
)

fig_pm25.show()
fig_cmr.show()
fig_poverty.show()
In [163]:
# Feature engineering 
# Merge on 'state' and 'YEAR' for alignment
df_cvd_htn_mort_combined = pd.merge(df_heart_dx_mort, df_htn_dx_mort, on=['state', 'YEAR'])


# View merged DataFrame
#df_cvd_htn_mort_combined.head()

df_cvd_htn_mort_combined_reup = df_cvd_htn_mort_combined.rename(columns={'RATE_x': 'Cvdmortrate', 'DEATHS_x': 'Cvddeathcount', 'URL_x': 'URL_cvdmort', 'RATE_y': 'Htndxdeathrate','DEATHS_y': 'Htndxdeathcount', 'URL_y': 'URL_htnmort'})

df_cvd_htn_mort_combined_reup.head()

# Save as csv if needed
#df_cvd_htn_mort_combined_reup.to_csv('cvd_htn_mort_rate_combined_data.csv', index=False)
Out[163]:
YEAR state Cvdmortrate Cvddeathcount URL_cvdmort Htndxdeathrate Htndxdeathcount URL_htnmort
0 2022 AL 234.2 14958 /nchs/pressroom/states/alabama/al.htm 13.2 849 /nchs/pressroom/states/alabama/al.htm
1 2022 AK 145.7 1013 /nchs/pressroom/states/alaska/ak.htm 8.6 56 /nchs/pressroom/states/alaska/ak.htm
2 2022 AZ 148.5 14593 /nchs/pressroom/states/arizona/az.htm 11.3 1109 /nchs/pressroom/states/arizona/az.htm
3 2022 AR 224.1 8664 /nchs/pressroom/states/arkansas/ar.htm 12.1 454 /nchs/pressroom/states/arkansas/ar.htm
4 2022 CA 142.4 66340 /nchs/pressroom/states/california/ca.htm 14.4 6727 /nchs/pressroom/states/california/ca.htm
In [164]:
df_acs_pm25_cmr_ses_index_state_combined.drop(columns=['year'], inplace=True)
df_acs_pm25_cmr_ses_index_state_combined.head(5)
Out[164]:
state median_income total_population_poverty poverty_count total_population_uninsured uninsured_count total_population_education_18 high_school_diploma ged_alternative associates_degree bachelors_degree masters_degree professional_degree doctorate_degree fip poverty_rate uninsured_rate educated_adults education_percent_educated_18 indexes FIPS Year PM2.5 CMR
0 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 20 1001 2009 6.402091 330.876172
1 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 21 1001 2010 6.942778 316.911479
2 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 41 1003 2009 5.419087 270.402216
3 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 42 1003 2010 5.837704 276.377191
4 AL 40489 4588899 804683 4616028 66730 3115982 27958 464551 88341 211422 68352 26346 18412 1 17.535426 1.445615 905382 29.056073 62 1005 2009 5.840124 383.159080

Correlation Analysis¶

Plots and Correlation map: These visualizations illustrate the relationships between socioeconomic factor variables, pm2.5 and CMR.¶
In [166]:
df_annualcounty_pm25_cmrCorr = df_annualcounty_pm25_cmr.corr(numeric_only=True)
#df_annualcounty_pm25_cmrCorr #view output
In [167]:
# Set seaborn themes
sns.set_theme(style='white')
sns.color_palette('viridis', as_cmap=True)
Out[167]:
viridis
viridis colormap
under
bad
over
In [168]:
# Create the plot
plt.figure(figsize=(10,6))
matrix = df_annualcounty_pm25_cmrCorr
mask = np.triu(np.ones_like(matrix, dtype=float))
sns.heatmap(df_annualcounty_pm25_cmrCorr,
           annot=True,
           linewidths=.5,
           cmap='viridis',
           fmt= '.2f',
           mask=mask)

# Specify the name of the plot
plt.title('Correlation Between Features')
plt.show()
No description has been provided for this image

There is a weak positive correlation between PM2.5 levels and cardiovascular mortality risk (CMR), with a correlation coefficient (r) of 0.41. This suggests that higher levels of air pollution, specifically fine particulate matter (PM2.5), are modestly associated with increased cardiovascular mortality. Additionally, there is a moderately strong negative correlation between the year and CMR (r = –0.63), indicating a possible declining trend in cardiovascular mortality over time.

In [170]:
df_merged_state_countyCorr = df_merged_state_county.corr(numeric_only=True)
#df_merged_state_countyCorr #view output
In [171]:
# Set seaborn themes
sns.set_theme(style='white')
sns.color_palette('viridis', as_cmap=True)
Out[171]:
viridis
viridis colormap
under
bad
over
In [172]:
# Create the plot
plt.figure(figsize=(10,6))
matrix = df_merged_state_countyCorr
mask = np.triu(np.ones_like(matrix, dtype=float))
sns.heatmap(df_merged_state_countyCorr,
           annot=True,
           linewidths=.5,
           cmap='viridis',
           fmt= '.2f',
           mask=mask)

# Specify the name of the plot
plt.title('Correlation Between Features')
plt.show()
No description has been provided for this image

Findings¶

Its worth noting that this heat map suggests from the correlation values between socio-economic indexes and cardiovascular mortality rate that cardiomortality rate increases as socioeconomic status index increases and this is in contrast to research that suggests that a higher socioeconomic status is associated with a lower CMR due to better health habits and healthcare access. Some possible reasons for this correlation may be due to confounding by region or other variables and could also be due SES indices capturing complexities such as counties with much older poupulation etc.

In [174]:
# Hypothesis test
from scipy.stats import ttest_ind

# Hypothesis "States/Counties with higher PM2.5 levels have higher CMR"
high_pm25 = df_merged_state_county[df_merged_state_county['PM2.5'] > df_merged_state_county['PM2.5'].median()]['CMR']
low_pm25 = df_merged_state_county[df_merged_state_county['PM2.5'] <= df_merged_state_county['PM2.5'].median()]['CMR']

# Perform t-test
t_stat, p_value = ttest_ind(high_pm25, low_pm25)
print(f"t-statistic: {t_stat}, p-value: {p_value}")
t-statistic: 7.1673660317328, p-value: 8.96756932006852e-13

Findings.¶

There is a statistically significant difference in the CMR between states/counties with high PM2.5 levels than those with low PM2.5 levels.

Associations between socioeconomic factors (poverty, education, and health insurance) and cardiovascular mortality rates across some U.S. states¶
In [177]:
#Correlation Analysis
df_acs_pm25_cmr_ses_index_state_combinedCorr = df_acs_pm25_cmr_ses_index_state_combined[['poverty_rate', 'uninsured_rate','education_percent_educated_18', 'PM2.5','CMR']].corr()
#df_acs_pm25_cmr_ses_index_state_combinedCorr #view output
In [178]:
# Set seaborn themes
sns.set_theme(style='white')
sns.color_palette('viridis', as_cmap=True)
Out[178]:
viridis
viridis colormap
under
bad
over
In [179]:
# Create the plot
plt.figure(figsize=(10,5))
matrix = df_acs_pm25_cmr_ses_index_state_combinedCorr
mask = np.triu(np.ones_like(matrix, dtype=float))
sns.heatmap(df_acs_pm25_cmr_ses_index_state_combinedCorr,
           annot=True,
           linewidths=.5,
           cmap='viridis',
           fmt= '.2f',
           mask=mask)

# Specify the name of the plot
plt.title('Correlation Between Features')
plt.show()
No description has been provided for this image
Boxplots on SES and CMR: They reveal systematic differences in CMR across socioeconomic groups.¶

Findings.¶

This suggests a modestly positive correlation between the poverty rate,pm2.5 and cardiovascularmortality rate, with a correlation coefficient (r) of 0.48 and 0.22 indicating that increasing poverty levels and pm2.5 levels may be associated with higher CMR.

In [182]:
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_merged_state_county, x='SES_quintile_2010', y='CMR', palette='coolwarm')
plt.title('Distribution of Cardiovascular Mortality Across SES Quintiles (2010)', fontsize=14)
plt.xlabel('SES Quintile')
plt.ylabel('Cardiovascular Mortality Rate')
plt.grid(True)
plt.tight_layout()
plt.show()
No description has been provided for this image

Findings.¶

The contrast noticed in the boxplots between the influence of social classification based on socioeconomic status on cardiovascular mortality and the influence of poverty levels classified into tertiles on cardiovascular mortality suggests that while socioeconomic status and poverty are related, their impacts on cardiovascular health may be distinct. Socioeconomic status likely captures broader factors, such as access to quality education, stable employment, and social support networks, whereas poverty levels focus more narrowly on income deprivation. Further statistical analysis is important to determine the significance of the observed differences.

In [184]:
# Categorize states into Low, Medium, High SES Groups
df_acs_pm25_cmr_ses_index_state_combined['SES_Group'] = pd.qcut(df_acs_pm25_cmr_ses_index_state_combined['poverty_rate'], q=3, labels=['1st poverty tertile', '2nd poverty tertile', '3rd poverty tertile'])

# Boxplot
plt.figure(figsize=(8,6))
sns.boxplot(data=df_acs_pm25_cmr_ses_index_state_combined, x="SES_Group", y="CMR", palette="viridis")
plt.title("Cardiovascular Mortality Rate by Socioeconomic Status based on poverty rate classification")
plt.xlabel("Socioeconomic Status Group")
plt.ylabel("Cardiovascular Mortality Rate (Standardized)")
plt.grid(True)
plt.show()
No description has been provided for this image
In [185]:
sns.pairplot(
    df_acs_pm25_cmr_ses_index_state_combined,
    x_vars=["PM2.5", "poverty_rate", "uninsured_rate","education_percent_educated_18"],
    y_vars=["CMR"]
)



plt.show()
No description has been provided for this image

This pairplot above provides a matrix of scatter plots, examining how different socioeconomic factors (poverty, education, insurance) relate to Cardiomortality rate (CMR).¶

Findings.¶

The pairwise relationships shows that higher pm2.5 rates may be associated with increased CMR and shows that lower education and higher poverty rates may be associated with increased CMR.

In [187]:
sns.pairplot(
    df_acs_pm25_cmr_ses_index_state_combined,
    x_vars=["PM2.5", "poverty_rate", "uninsured_rate","education_percent_educated_18"],
    y_vars=["CMR"],
    hue="SES_Group",  
    palette="viridis" ,
    height=4,  
    aspect=1.5 
)

# Add a legend
plt.legend(title="Cardiovascular mortality rate compared to socioeconomic factors", bbox_to_anchor=(1.05, 1), loc='upper left')

# Adjust layout
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

Scatter Plots above of CMR in relation to PM2.5 and Socioeconomic Indicators:¶

These plots demonstrate that environmental and social determinants influence CMR. It suggests a differential impact of PM2.5 on cardiovascular mortality across socioeconomic status groups, potentially indicating increased vulnerability in lower SES communities who may experience elevated CMR even at lower pollution levels, and further reveals socioeconomic differences wherein higher poverty and uninsured rates, coupled with lower education levels (probably indicative of lower SES), are associated with increased CMR, while the stratification by SES Group allows for a preliminary exploration of the intersectional nature of these factors by showing how the relationship between one socioeconomic indicator and CMR may vary across different SES levels.

In [189]:
# Group by variable1 and calculate the average percentage of variable2 for each variable1
averageVariable1 = df_acs_pm25_cmr_ses_index_state_combined.groupby('PM2.5')['CMR'].mean()

# Sort variable1 based on the highest average percentage of variable2
maxVariable1 = averageVariable1.sort_values(ascending=False).head(50)

# Create a scatter plot
plt.figure(figsize=(12, 6))
sns.regplot(x=maxVariable1.index, y=maxVariable1.values, scatter=True, line_kws={'color': 'red'})
plt.scatter(maxVariable1.index, maxVariable1.values)
plt.xlabel('PM2.5')
plt.ylabel('CMR')
plt.title(' PM2.5 with Highest Average Percentage of CMR')
plt.xticks(rotation=90)
plt.grid(axis='y')
plt.tight_layout()
# Show the visualization
plt.show()
No description has been provided for this image

This plot above and below explores the impact of air pollution (PM2.5) on cardiovascular mortality.¶

Findings.¶

Higher PM2.5 levels appear to be linked to an increase in CMR when variables are standardized, reinforcing environmental concerns in cardiovascular health. But higher PM2.5 levels appear to be linked to a decrease in CMR when variables are averaged.

In [191]:
# Scatter Plot: PM2.5 vs Cardiovascular Mortality Rate
plt.figure(figsize=(8,6))
sns.regplot(data=df_acs_pm25_cmr_ses_index_state_combined, x="PM2.5", y="CMR", scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title("PM2.5 vs. Cardiovascular Mortality Rate")
plt.xlabel("PM2.5 (Standardized)")
plt.ylabel("Cardiovascular Mortality Rate (Standardized)")
plt.grid(True)
plt.show()
No description has been provided for this image
In [ ]:
 
In [192]:
# States of interest
states = ['DC', 'MD', 'VA', 'WV', 'PA', 'DE', 'MN', 'NY', 'NJ', 'TX', 'OH']
df_filtered = df_acs_pm25_cmr_ses_index_state_combined[df_acs_pm25_cmr_ses_index_state_combined['state'].isin(states)]

# Group by state and calculate mean CMR and Hypertension Rate
df_grouped = df_filtered.groupby('state')[['CMR', 'PM2.5','poverty_rate','uninsured_rate','education_percent_educated_18']].mean().reset_index()

# Join the grouped dataframe for plotting
df_melted = df_grouped.melt(
    id_vars=['state'], 
    value_vars=['CMR', 'PM2.5','poverty_rate','uninsured_rate','education_percent_educated_18'],  
    var_name='Metric', 
    value_name='Value'
)

# Create a side-by-side bar plot
plt.figure(figsize=(12, 8))
sns.barplot(
    x='state',
    y='Value',
    hue='Metric',  
    data=df_melted,
    palette='viridis'
)
plt.title('Average Cardiovascular Mortality Rate (CMR) Per 100,000,PM2.5, Poverty Rate, Lack of Health Insurance and Educated Adults Percentage by State', fontsize=16)
plt.xlabel('State', fontsize=14)
plt.ylabel('Value', fontsize=14)
plt.xticks(rotation=90)
plt.legend(title='Metric')
plt.show()
No description has been provided for this image

This scatter plot examines the correlation between the poverty rate and cardiovascular mortality rates.¶

Findings.¶

The plot shows a positive correlation with standardized and non-standardized variables, indicating that states with higher poverty rates may have some influence on increased cardiovascular mortality rates.

In [194]:
# Group by variable1 and calculate the average percentage of variable2 for each variable1
averageVariable1 = df_acs_pm25_cmr_ses_index_state_combined.groupby('poverty_rate')['CMR'].mean()

# Sort variable1 based on the highest average percentage of variable2
maxVariable1 = averageVariable1.sort_values(ascending=False).head(50)

# Create a scatter plot
plt.figure(figsize=(12, 6))
sns.regplot(x=maxVariable1.index, y=maxVariable1.values, scatter=True, line_kws={'color': 'red'})
plt.xlabel('poverty_rate')
plt.ylabel('CMR')
plt.title(' Poverty rate with Highest Average Percentage of CMR')
plt.xticks(rotation=90)
plt.grid(axis='y')
plt.tight_layout()
# Show the visualization
plt.show()
No description has been provided for this image

This scatter plot examines the correlation between the higher education rates and cardiovascular mortality rates.¶

Findings.¶

The plot shows a potential negative correlation with standardized and non_standardized variables, indicating that states with higher educated citizen rates may have some influence on decreased cardiovascular mortality rates.

In [196]:
# Group  by variable1 and calculate the average percentage of variable2 for each variable1
averageVariable1 = df_acs_pm25_cmr_ses_index_state_combined.groupby('education_percent_educated_18')['CMR'].mean()

# Sort variable1 based on the highest average percentage of variable2
maxVariable1 = averageVariable1.sort_values(ascending=False).head(50)

# Create a scatter plot
plt.figure(figsize=(12, 6))
sns.regplot(x=maxVariable1.index, y=maxVariable1.values, scatter=True, line_kws={'color': 'red'})
plt.xlabel('education_percent_educated_18')
plt.ylabel('CMR')
plt.title('Higher Educated Person rate with Highest Average Percentage of CMR')
plt.xticks(rotation=90)
plt.grid(axis='y')
plt.tight_layout()
# Show the visualization
plt.show()
No description has been provided for this image

This scatter plot examines the correlation between the uninsured rate and cardiovascular mortality rates.¶

Findings.¶

The plot shows a potential positive correlation, indicating that states with higher uninsured rates may have some influence on increased cardiovascular mortality rates.

In [198]:
#Group by variable1 and calculate the average percentage of variable2 for each variable1
averageVariable1 = df_acs_pm25_cmr_ses_index_state_combined.groupby('uninsured_rate')['CMR'].mean()

# Sort variable1 based on the highest average percentage of variable2
maxVariable1 = averageVariable1.sort_values(ascending=False).head(50)

# Create a scatter plot
plt.figure(figsize=(12, 6))
sns.regplot(x=maxVariable1.index, y=maxVariable1.values, scatter=True, line_kws={'color': 'red'})
plt.xlabel('uninsured_rate')
plt.ylabel('CMR')
plt.title('Health Uninsurance Rate with Highest Average Percentage of CMR')
plt.xticks(rotation=90)
plt.grid(axis='y')
plt.tight_layout()
# Show the visualization
plt.show()
No description has been provided for this image
In [199]:
#  Plotly Scatter chart
import plotly.express as px 

fig = px.scatter (df_acs_pm25_cmr_ses_index_state_combined,
x='poverty_rate',
y = 'CMR' ,
color = 'education_percent_educated_18',
title = 'The Interaction Between CVD Mortality, and Socioeconomic Status Factors( Poverty rate, higher education)',
labels={
"poverty_rate": "Poverty rate",
"CMR": "Cardiovascular Mortality Rates",
"education_percent_educated_18": "Rates of Population with Higher Education "
},
color_continuous_scale=px. colors. sequential.Viridis)
fig. show()

Findings.¶

This visualization above provides a matrix of plots, examining how different socioeconomic factors (Poverty, Higher education, Health insurance) relate to CMR. The relationships suggest that higher education and lower poverty rates may be associated with decreased CMR.

In [201]:
#  Plotly Scatter chart
import plotly.express as px 

fig = px.scatter (df_acs_pm25_cmr_ses_index_state_combined,
x='PM2.5',
y = 'CMR' ,
color = 'uninsured_rate',
title = 'The Interaction Between CVD Mortality, PM2.5 and a Socioeconomic Status factor(uninsured health rate)',
labels={
"PM2.5": "Particulate Matter 2.5 levels",
"CMR": "Cardiovascular Mortality Rates",
"uninsured_rate": "Rates of Population Lacking Health Insurance "
},
color_continuous_scale=px. colors. sequential.Viridis)
fig. show()

Findings.¶

This visualization above provides a plot, examining how a socioeconomic factor(Health insurance) and PM2.5 relates to Cardiovascular Mortality. The relationships subtlely suggest that as PM2.5 Pollutant levels rise in combination with higher rates of lack of health insurance Cardiovascular Mortality may also rise.

How does hypertension prevalence influence cardiovascular mortality rates?¶

In [206]:
df_cvd_htn_mort_combined_reup_clean=df_cvd_htn_mort_combined_reup.drop(columns=['URL_cvdmort', 'URL_htnmort'])

df_cvd_htn_mort_combined_reup_clean.tail()
Out[206]:
YEAR state Cvdmortrate Cvddeathcount Htndxdeathrate Htndxdeathcount
496 2005 VA 203.0 14192 7.9 549
497 2005 WA 180.5 10985 7.5 452
498 2005 WV 253.6 5538 11.6 253
499 2005 WI 190.6 11842 7.1 451
500 2005 WY 188.3 952 3.9 20
In [207]:
#Correlation Analysis
df_cvd_htn_mort_combined_reup_cleanCorr = df_cvd_htn_mort_combined_reup_clean[['Cvdmortrate', 'Htndxdeathrate']].corr()
df_cvd_htn_mort_combined_reup_cleanCorr

from scipy.stats import pearsonr
#Pearson correlation and p-value
corr_coef, p_value = pearsonr(df_cvd_htn_mort_combined_reup_clean['Cvdmortrate'], df_cvd_htn_mort_combined_reup_clean['Htndxdeathrate'])

corr_coef,p_value
Out[207]:
(0.2952786039775407, 1.5444236806717292e-11)

Findings.¶

This suggests a weak positive relationship between Hypertension mortality rate and Cardiovascular disease mortality rate with a good significance level.

In [209]:
# Scatter Plot: Hypertension Prevalence vs Cardiovascular Mortality Rate
plt.figure(figsize=(8,6))
sns.regplot(data=df_cvd_htn_mort_combined_reup_clean, x="Htndxdeathrate", y="Cvdmortrate", scatter_kws={'alpha':0.5}, line_kws={'color':'red'})
plt.title("Hypertension Mortality vs Cardiovascular Mortality Rate")
plt.xlabel("Hypertension Mortality Rate (Standardized)")
plt.ylabel("Cardiovascular Mortality Rate (Standardized)")
plt.grid(True)
plt.show()
No description has been provided for this image
In [210]:
# States of interest
states = ['DC', 'MD', 'VA', 'WV', 'PA', 'DE', 'MN', 'NY', 'NJ', 'TX', 'OH']
df_filtered = df_cvd_htn_mort_combined_reup[df_cvd_htn_mort_combined_reup['state'].isin(states)]

# Group by state and calculate mean CMR and Hypertension Rate
df_grouped = df_filtered.groupby('state')[['Cvdmortrate', 'Htndxdeathrate']].mean().reset_index()

# Join the grouped dataframe for plotting
df_melted = df_grouped.melt(
    id_vars=['state'], 
    value_vars=['Cvdmortrate', 'Htndxdeathrate'],  
    var_name='Metric', 
    value_name='Value'
)

# Create a side-by-side bar plot
plt.figure(figsize=(12, 8))
sns.barplot(
    x='state',
    y='Value',
    hue='Metric',  
    data=df_melted,
    palette='viridis'
)
plt.title('Average Cardiovascular Mortality Rate (CMR) and Hypertension Rate by State', fontsize=16)
plt.xlabel('State', fontsize=14)
plt.ylabel('Value', fontsize=14)
plt.xticks(rotation=90)
plt.legend(title='Metric')
plt.show()
No description has been provided for this image

The visualizations in the figure above are expected for Cardiovascular disease mortality and hypertensive disease rates considering that though hypertension is a high risk factor for cardiovascular Death and CMR, Cardiovascular disease and mortality can be due to a vast number of other conditions.

Regression Analysis: The regression results quantify how various factors contribute to CMR.

In [213]:
# independent and dependent variables
X = df_cvd_htn_mort_combined_reup_clean[['Htndxdeathrate']]  
y = df_cvd_htn_mort_combined_reup_clean['Cvdmortrate']  

import statsmodels.api as sm
# intercept
X = sm.add_constant(X)

# Fit the regression model
model = sm.OLS(y, X).fit()

# Display summary statistics
model.summary()
Out[213]:
OLS Regression Results
Dep. Variable: Cvdmortrate R-squared: 0.087
Model: OLS Adj. R-squared: 0.085
Method: Least Squares F-statistic: 47.66
Date: Sat, 19 Apr 2025 Prob (F-statistic): 1.54e-11
Time: 02:36:09 Log-Likelihood: -2434.0
No. Observations: 501 AIC: 4872.
Df Residuals: 499 BIC: 4880.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 139.2546 4.984 27.940 0.000 129.462 149.047
Htndxdeathrate 3.8284 0.555 6.904 0.000 2.739 4.918
Omnibus: 30.390 Durbin-Watson: 1.994
Prob(Omnibus): 0.000 Jarque-Bera (JB): 34.272
Skew: 0.629 Prob(JB): 3.61e-08
Kurtosis: 3.248 Cond. No. 32.5


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Findings.¶

Visualizations show that though Hypertension mortality rate may have some influence on Cardiovascular moratlity rate, its influence is largely weak which is expected as CMR can be be linked to variety of factors which may sometimes be inter-related to hypertension. Furthermore, our regression model shows a statistically significant relationship between Hypertension-related death rate and Cardiovascular mortality rate. The positive and significant coefficient for Hypertension-related death rates suggests that higher hypertension-related death rates are associated with higher cardiovascular mortality rates and this gives some insight on the influence of hypertension prevalence on cardiovascular mortality rates albeit the low r-squared value (0.087) indicates that while hypertension-related death rates are significant, they explain only a small portion (8.7%) of the variation in cardiovascular mortality rates. This could mean that other conditions or factors like PM2.5 levels, socioeconomic factors are important and should be included as strong influences.

In [215]:
# X and Y variables
X_variable = 'CMR'
y_variables = ['PM2.5']  

# Add a intercept to the independent variables
X = sm.add_constant(df_acs_pm25_cmr_ses_index_state_combined[y_variables])
y = df_acs_pm25_cmr_ses_index_state_combined[X_variable]

# Fit the OLS model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CMR   R-squared:                       0.047
Model:                            OLS   Adj. R-squared:                  0.047
Method:                 Least Squares   F-statistic:                     422.0
Date:                Sat, 19 Apr 2025   Prob (F-statistic):           1.47e-91
Time:                        02:36:09   Log-Likelihood:                -46324.
No. Observations:                8528   AIC:                         9.265e+04
Df Residuals:                    8526   BIC:                         9.267e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        203.2342      2.714     74.888      0.000     197.914     208.554
PM2.5          8.8104      0.429     20.541      0.000       7.970       9.651
==============================================================================
Omnibus:                      690.091   Durbin-Watson:                   0.832
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              908.627
Skew:                           0.704   Prob(JB):                    4.95e-198
Kurtosis:                       3.759   Cond. No.                         29.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [216]:
# X and Y variables
X_variable = 'CMR'
y_variables = ["uninsured_rate",'PM2.5']  

# Add a intercept to the independent variables
X = sm.add_constant(df_acs_pm25_cmr_ses_index_state_combined[y_variables])
y = df_acs_pm25_cmr_ses_index_state_combined[X_variable]

# Fit the OLS model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CMR   R-squared:                       0.054
Model:                            OLS   Adj. R-squared:                  0.054
Method:                 Least Squares   F-statistic:                     242.2
Date:                Sat, 19 Apr 2025   Prob (F-statistic):          4.74e-103
Time:                        02:36:09   Log-Likelihood:                -46294.
No. Observations:                8528   AIC:                         9.259e+04
Df Residuals:                    8525   BIC:                         9.262e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const            189.3197      3.250     58.255      0.000     182.949     195.690
uninsured_rate     5.1492      0.667      7.722      0.000       3.842       6.456
PM2.5              9.4450      0.435     21.699      0.000       8.592      10.298
==============================================================================
Omnibus:                      703.300   Durbin-Watson:                   0.843
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              937.813
Skew:                           0.707   Prob(JB):                    2.27e-204
Kurtosis:                       3.801   Cond. No.                         36.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [217]:
# X and Y variables
X_variable = 'CMR'
y_variables = ['poverty_rate', 'PM2.5']  

# Add a intercept to the independent variables
X = sm.add_constant(df_acs_pm25_cmr_ses_index_state_combined[y_variables])
y = df_acs_pm25_cmr_ses_index_state_combined[X_variable]

# Fit the OLS model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CMR   R-squared:                       0.262
Model:                            OLS   Adj. R-squared:                  0.262
Method:                 Least Squares   F-statistic:                     1511.
Date:                Sat, 19 Apr 2025   Prob (F-statistic):               0.00
Time:                        02:36:09   Log-Likelihood:                -45236.
No. Observations:                8528   AIC:                         9.048e+04
Df Residuals:                    8525   BIC:                         9.050e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           69.7983      3.591     19.439      0.000      62.760      76.837
poverty_rate     9.3889      0.189     49.780      0.000       9.019       9.759
PM2.5            7.1237      0.379     18.792      0.000       6.381       7.867
==============================================================================
Omnibus:                      487.993   Durbin-Watson:                   1.059
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              645.338
Skew:                           0.540   Prob(JB):                    7.36e-141
Kurtosis:                       3.807   Cond. No.                         114.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [218]:
# X and Y variables
X_variable = 'CMR'
y_variables = ['poverty_rate', "uninsured_rate", 'PM2.5']  

# Add a intercept to the independent variables
X = sm.add_constant(df_acs_pm25_cmr_ses_index_state_combined[y_variables])
y = df_acs_pm25_cmr_ses_index_state_combined[X_variable]

# Fit the OLS model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CMR   R-squared:                       0.270
Model:                            OLS   Adj. R-squared:                  0.270
Method:                 Least Squares   F-statistic:                     1051.
Date:                Sat, 19 Apr 2025   Prob (F-statistic):               0.00
Time:                        02:36:09   Log-Likelihood:                -45188.
No. Observations:                8528   AIC:                         9.038e+04
Df Residuals:                    8524   BIC:                         9.041e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const             76.3900      3.633     21.026      0.000      69.268      83.512
poverty_rate      10.0973      0.201     50.250      0.000       9.703      10.491
uninsured_rate    -6.1647      0.627     -9.824      0.000      -7.395      -4.935
PM2.5              6.2367      0.388     16.089      0.000       5.477       6.997
==============================================================================
Omnibus:                      510.461   Durbin-Watson:                   1.065
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              669.638
Skew:                           0.561   Prob(JB):                    3.89e-146
Kurtosis:                       3.790   Cond. No.                         117.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [219]:
# X and Y variables
X_variable = 'CMR'
y_variables = ['poverty_rate', 'education_percent_educated_18', "uninsured_rate", 'PM2.5']  

# Add a intercept to the independent variables
X = sm.add_constant(df_acs_pm25_cmr_ses_index_state_combined[y_variables])
y = df_acs_pm25_cmr_ses_index_state_combined[X_variable]

# Fit the OLS model
model = sm.OLS(y, X).fit()

# Print the model summary
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    CMR   R-squared:                       0.303
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     926.8
Date:                Sat, 19 Apr 2025   Prob (F-statistic):               0.00
Time:                        02:36:09   Log-Likelihood:                -44990.
No. Observations:                8528   AIC:                         8.999e+04
Df Residuals:                    8523   BIC:                         9.003e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                           512.5388     21.964     23.336      0.000     469.485     555.593
poverty_rate                      4.8573      0.326     14.894      0.000       4.218       5.497
education_percent_educated_18   -10.8550      0.539    -20.122      0.000     -11.912      -9.798
uninsured_rate                  -12.5539      0.690    -18.182      0.000     -13.907     -11.200
PM2.5                             6.1614      0.379     16.267      0.000       5.419       6.904
==============================================================================
Omnibus:                      395.441   Durbin-Watson:                   1.111
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              508.545
Skew:                           0.475   Prob(JB):                    3.72e-111
Kurtosis:                       3.727   Cond. No.                     1.53e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.53e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Findings.¶

Particulate matter 2.5 consistently shows a significant positive association with cardiovascular mortality rate across all models. While socioeconomic factors are important predictors of cardiovascular mortality rate, poverty rate and education level, appear to have a substantial impact. The models explanatory power varies, with the model including poverty rate and PM2.5 having the highest R-squared in addition to a statistically significant relationships between all three independent variables and cardiovascular mortality rate. While higher poverty rates and pm2.5 levels are associated with higher cardiovascular mortality rate, higher education levels are associated with lower cardiovascular mortality rate with the model accounting for 30.3% of the variance in cardiovascular mortality rate. The model also presents possible multi-collinearity issues probable due to inter-relationship between the variables.

Findings.¶

The visualizations in this paper effectively reflect the relationship between cardiovascular mortality rates (CMR), air pollution (PM2.5), and socioeconomic factors such as poverty, education, and healthcare access(health uninsurance rate). The correlation map and regression analysis confirm that both environmental and social determinants significantly contribute to variations in CMR across different U.S. states. Higher PM2.5 exposure is associated with increased cardiovascular mortality, reinforcing concerns about air pollution's impact on heart disease. Lower socioeconomic status (SES) groups experience higher CMR, highlighting the role of poverty, education disparities and possibly other factors in cardiovascular health.

Summary of Key Findings¶

This reveals a critical interplay between environmental pollution, socioeconomic status (SES), and cardiovascular disease mortality outcomes. Notably, PM2.5 exposure emerged as a statistically significant predictor of cardiovascular mortality; however, its influence was disproportionately severe in communities with lower SES, indicating that socioeconomic vulnerabilities amplify the detrimental effects of pollution on health. Furthermore, SES itself acts as a crucial risk multiplier, with lower-income communities characterized by higher uninsured rates and lower educational attainment experiencing elevated cardiovascular and hypertension mortality. This aligns with the concept of multifactorial disadvantage, where the aggregation of multiple vulnerabilities worsens adverse health outcomes. While initial observations suggested a potentially limited direct influence of overall hypertension mortality rates on cardiovascular mortality, regression analysis identified a statistically significant positive association between hypertension-related death rates and cardiovascular mortality rates, albeit explaining a minute portion of the variance. This suggests the influence of hypertension as a contributing factor which is evident in clinical medicine literature, while also highlighting the likely significant roles of other conditions and socioeconomic determinants. Ultimately, the findings confirm a synergistic effect wherein the combination of pollution and low socioeconomic status leads to higher cardiovascular mortality rates than would be influenced by either factor in isolation, exposing a compounding public and social health issue. In addition,while visualizations suggest a potentially weak direct influence of overall hypertension mortality rates on cardiovascular mortality rates—an expected finding given the multifactorial nature of CMR, which can be linked to various factors sometimes interrelated with hypertension—our regression model revealed a statistically significant positive relationship between hypertension-related death rates and cardiovascular mortality rates. The positive and significant coefficient indicates that higher hypertension-related death rates are associated with higher cardiovascular mortality rates, offering some insight into the influence of hypertension prevalence on cardiovascular mortality. However, the low R-squared value (0.087) suggests that hypertension-related death rates alone explain only a limited portion (8.7%) of the variation in cardiovascular mortality rates. This implies that other significant conditions or factors, such as PM2.5 levels and broader socioeconomic determinants, likely exert substantial influence and warrant further investigation. It is important to note that these are ecological correlations. While they can suggest potential relationships at the population level, they do not establish individual-level causation. Further individual-level studies would be needed to confirm these associations and understand the underlying mechanisms.

Recommendations¶

The need for immediate and transformative action to achieve socio-environmental justice is clear, demanding that the burden of pollution no longer fall disproportionately on vulnerable communities. It is our hope that public health departments, environmental regulators, and local governments will use these findings to prioritize the most vulnerable communities for intervention and improve lives. To this end, a fundamental shift in policy and practice is required, beginning with a decisive four-year phased strategy. In the initial phase (Year 1 and 2), a re-evaluation and strengthening of air quality regulations must prioritize the most vulnerable. With "high burden" states to counties, identified through a confluence of high pollution levels and significant socioeconomic vulnerability, serving as pilot sites for enhanced emissions controls, robust and targeted environmental monitoring, and the enforcement of stricter policies for individuals, enterpreneurs, businesses and industries. This streamlined approach, necessitates moving beyond uniform PM2.5 thresholds to reflect the amplified risks faced by communities within the lowest SES quintiles. Concurrently, addressing the immediate health disparities requires a dedicated and phased investment in expanding healthcare access. In the initial phase, mobile health clinics should be strategically deployed into these pilot "high burden" zones, alongside steps to expand Medicaid eligibility. Building on the lessons learned, subsequent years should scale these successful outreach models to other rural and low-income areas exhibiting high pollution and cardiovascular mortality rates. This expansion should be supported by the direct allocation of increasing public health resources to these underserved regions. A proactive and systemic approach also demands a phased investment in the long-term resilience of these communities through education and workforce development. Starting in Year 1 within the pilot counties, data on SES and pollution exposure should be used to strategically channel initial education grants and adult learning initiatives. As the strategy progresses into Years 3 and 4, these efforts should be scaled, with a particular emphasis on fostering job creation programs in environmental remediation and the burgeoning clean energy sector, empowering residents to participate in the transition towards a healthier environment. To prevent the perpetuation of environmental injustice, a change in building and industrial permitting is essential, to be implemented system-wide over a four-year period. Commencing immediately, comprehensive Green Health Equity Impact or Impact Social Health Equity Impact Assessments must be mandated for all new and renewed permits, utilizing established environmental justice screening tools to ensure that potential hazards are not disproportionately sited in vulnerable zones and that new structures or renovations are environmentally suitable and promote clean air. This proactive approach aims to alleviate the systemic burdens in communities with this issue. The initiation of these programs in the pilot counties within Year 1 necessitates the immediate allocation of 100 million dollars in federal and state block grants, with matching funds actively sought from the Environmental Protection Agency (EPA) and the Centers for Disease Control and Prevention (CDC). Oversight of the implementation will be entrusted to joint EPA-Health and Human Services (HHS) regional task forces, strategically comprised of city county officials, state officials, environmental scientists, public health experts, physicians, data scientists, data analysts, urban planners, and health policy analysts with a dedicated focus on equity. Parallel legislative action must be pursued at the state level throughout this four-year period to empower effective enforcement of strengthened environmental regulations, ensuring accountability and long-term sustainability. Furthermore, all implemented programs, from the pilot phase onwards, must embed meaningful community engagement, ensure transparency in decision-making processes, and incorporate strict and ongoing impact evaluation to track progress and ensure accountability in the pursuit of socio-environmental justice for all communities, regardless of socioeconomic status. This research is intended to inform action and improve lives. The findings should be used by public health departments, environmental regulators, and local governments to prioritize the most vulnerable communities for intervention.

Conclusion¶

This paper examined the compounded influence of socioeconomic status (SES) factors and PM2.5 exposure on cardiovascular disease (CVD) mortality rates, revealing not only statistically significant associations but also highlighting a clear pattern of systemic neglect that has allowed environmental and social vulnerabilities to converge with devastating health consequences. The findings demonstrate a critical interplay where lower SES amplifies the detrimental impact of PM2.5, leading to elevated CVD mortality rates concentrated within the U.S. Addressing this socio-environmental issue requires a fundamental shift in policy and practice, commencing with a decisive four-year phased strategy encompassing reforms across multiple domains. This includes a re-evaluation of air quality standards to reflect the principle of differential vulnerability, prioritizing enhanced emissions controls and stricter permitting in the most burdened, lowest SES communities. Simultaneously, expanding healthcare access through targeted outreach like mobile clinics and broadened Medicaid eligibility is crucial to mitigate adverse health outcomes. Furthermore, investing in education and workforce development within these communities, particularly in green sectors, offers a pathway towards long-term resilience. Finally, an overhaul of building and industrial permitting, mandating comprehensive Green or Social Health Equity Impact Assessments, is essential to prevent the further socio-environmental decline and promote healthier environments. The proposed implementation, unfolding over four years, emphasizes practical, phased, and measurable steps that center community engagement, transparency, and strict impact evaluation, moving beyond mere regulatory compliance. This paper, therefore, provides more than just insight; it offers a roadmap for change. The converging influence of environmental exposure and inadequate social protection on cardiovascular mortality rate represents a policy failure, not an unavoidable reality. The weight of the evidence compels a shift in our approach: from passively monitoring harm to actively preventing it, and from merely studying inequality to dismantling the systemic barriers that perpetuate it, ultimately striving for socio-environmental and health justice for all.

References¶

Cox Jr., L. A. (2018). Socioeconomic and particulate air pollution correlates of heart disease risk. Environmental Research, 166, 409–416. https://doi.org/10.1016/j.envres.2018.07.023.

Crouse, D. L., Peters, P. A., van Donkelaar, A., Goldberg, M. S., Villeneuve, P. J., Brion, O., Khan, S., et al. (2012). Air pollution and mortality in the medicare population exposed to long-term PM2.5. Environmental Health Perspectives, 120(5), 708–714. https://doi.org/10.1289/ehp.1104049.

Di, Q., Wang, Y., Zanobetti, A., Wang, Y., Koutrakis, P., Choirat, C., Dominici, F., & Schwartz, J. D. (2017). Air pollution and mortality in the medicare population. New England Journal of Medicine, 376(26), 2513–2522. https://doi.org/10.1056/NEJMoa1702747.

Krittanawong, C., Qadeer, Y. K., Hayes, R. B., Wang, Z., Thurston, G. D., Virani, S., & Lavie, C. J. (2023). PM2.5 and cardiovascular diseases: State-of-the-Art review. International Journal of Cardiology and Cardiovascular Risk Prevention, 20, 200217. https://doi.org/10.1016/j.ijcrp.2023.200217.

Ma, Y., Zang, E., Opara, I., Lu, Y., Krumholz, H. M., & Chen, K. (2023). Racial/ethnic disparities in PM2.5-attributable cardiovascular mortality burden in the United States. Nature Human Behaviour, 7, 2074–2083.

Phelan, J. C., Link, B. G., & Tehranifar, P. (2010). Social conditions as fundamental causes of health inequalities: Theory, evidence, and implications. Journal of Health and Social Behavior, 51(1_suppl), S28–S40.